机器学习笔记

目录

统计学习方法. 1

统计学习方法概论. 1

基本概念. 1

统计学习三要素. 1

模型的评价与模型选择. 2

机器学习. 3

加载数据集. 3

加载数据. 3

划分数据集. 3

学习和预测. 3

Example 3

保存模型. 4

最近邻法. 4

画出decision边界. 4

模型选择:选模型、选参数. 5

维数灾. 7

感知器. 8

K近邻算法. 9

线性模型. 9

逻辑回归LogisticRegression 12

AdaBoost 12

支持向量机SVM 14

聚类. 15

Decompositions分解、降维. 17

Pipelining 管道. 19

类型转换. 21

1.没有特殊说明的话,输入需要(?)被转换为float64 21

2.回归结果将转换为float64,分类目标则保持原态. 21

Numpy 21

1.array vs list 21

2.np.unique(iris_y) 22

Pandas 22

Example 22

Working With Text Data - 20 newsgroups dataset 22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

统计学习方法

统计学习方法概论

基本概念

是联合概率分布,独立同分布产生数据。X,Y具有联合概率分布是监督学习的基本假设。

假设空间(Hypothesis space)中,用决策函数 或 条件概率分布两种表示。

统计学习三要素

方法=模型+策略+算法

模型

假设空间可以定义为决策函数的集合 ,也可以定义为条件概率的集合 。参数向量  取值于n维参数空间。

策略

损失函数一般用L(Y(x),f(x))表示;常用的损失函数有:

0-1损失函数:  

平方损失函数:  

绝对损失函数:L=|Y-f(X)|

对数损失函数或对数似然损失函数:    

指数损失函数:  

差比较图###

 

期望风险:    

经验风险:  

结构风险: .

结构风险最小化是为了防止过拟合提出来的模型,等价于正则化J(f)为模型的复杂度,模型越复杂,越大,越简单,越小。贝叶斯估计中的最大后验概率就是结构风险最小的一个列子

算法

根据策略,从假设空间中选取最优模型和参数  。

模型的评价与模型选择

训练误差和测试误差

 其中误差是由特定的损失函数算出来的,比如0-1,平方差损失函数等。

 

 

机器学习

加载数据集

加载数据

from sklearn import datasets

iris = datasets.load_iris()

划分数据集

随机排列

np.random.seed(0)
# indices = np.random.permutation(len(iris_X))    # 随机排列
# print indices
# iris_X_train = iris_X[indices[:-10]]
# iris_y_train = iris_y[indices[:-10]]
# iris_X_test = iris_X[indices[-10:]]
# iris_y_test = iris_y[indices[-10:]]

 

学习和预测

Example

from sklearn import svm

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(digits.data[:-1], digits.target[:-1])
print(clf)
print clf.predict(digits.data[-1:])

 

更新参数(sklearn.pipeline.Pipeline.set_params

clf.set_params(kernel='linear').fit(X, y)  
clf.set_params(kernel='rbf').fit(X, y) 

 

保存模型

import pickle

s = pickle.dumps(clf)
clf2 = pickle.loads(s)
print digits.target[-2]
print clf2.predict(digits.data[-2])

# 2
from sklearn.externals import joblib

joblib.dump(clf, 'filename.pkl')
clf = joblib.load('filename.pkl')

 

最近邻法

from sklearn.neighbors import KNeighborsClassifier
>>> knn = KNeighborsClassifier()
>>> knn.fit(iris_X_train, iris_y_train) 

 

画出decision边界

# 颜色地图
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for weights in ['uniform', 'distance']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
h = .02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    print(xx)
    # Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)
    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    print(xx.shape)
    plt.figure()
# 填充数据网格
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
    # plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap_light)
    # 画training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights))

plt.show()

模型选择:选模型、选参数

Score, and cross-validated scores

score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.

Cross-validation

kfold = cross_validation.KFold(len(X_digits), n_folds=3)

>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in kfold]

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

 

>>> cross_validation.cross_val_score(svc, X_digits, y_digits, cv=kfold, n_jobs=-1)

array([ 0.93489149,  0.95659432,  0.93989983])

 

Cross-validation generators

KFold (n, k)

StratifiedKFold (y, k)

LeaveOneOut (n)

LeaveOneLabelOut (labels)

Split it K folds, train on K-1 and then test on left-out

It preserves the class ratios / label distribution within each fold.

Leave one observation out

Takes a label array to group observations

 

Grid-search and cross-validated estimators

from sklearn.grid_search import GridSearchCV

Cs = np.logspace(-6, -1, 10)

clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),n_jobs=-1)

>>> clf.fit(X_digits[:1000], y_digits[:1000])

GridSearchCV(cv=None,...

>>> clf.best_score_                                 

0.925...

>>> clf.best_estimator_.C                           

0.0077...

>>> # Prediction performance on test set is not as good as on train set

>>> clf.score(X_digits[1000:], y_digits[1000:])     

0.943...

GridSearchCV 默认用的 3-fold cross-validation。但是如果是分类器不是回归问题

将使用StratifiedKFold来保证每一折label比例相同。

Cross-validated estimators

from sklearn import linear_model, datasets

lasso = linear_model.LassoCV()

diabetes = datasets.load_diabetes()

X_diabetes = diabetes.data

y_diabetes = diabetes.target

lasso.fit(X_diabetes, y_diabetes)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,

    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,

    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,

    verbose=False)

>>> # The estimator chose automatically its lambda:

>>> lasso.alpha_

0.01229...

 

维数灾

首先 Error = Bias + Variance

Error反映的是整个模型的准确度,Bias反映的是模型在样本上的输出与真实值之间的误差,即模型本身的精准度,Variance反映的是模型每一次输出结果与模型输出期望之间的误差,即模型的稳定性。


简单的数据N = 10 * d

严格的数据N = 10 ^ d

 

感知器

Perceptron Learning Algorithm

PLA与SGD的关系

Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None)

K近邻算法

  1. 无显式学习过程,通过投票的分类策略进行k近邻分类
  2. 模型是利用训练集数据对特征向量空间划分。
  3. 三要素:距离量度、k值选择、分类策略。
  4. K值越大,模型越简单;k值越小,模型越复杂,容易过拟合。
  5. 投票策略规则等价于经验风险最小化。
  6. 线型扫描速度太慢,用kd树结构来提升性能。(数据结构知识需要)。

 

线性模型

from sklearn import linear_model

regr = linear_model.LinearRegression()

>>> regr.fit(diabetes_X_train, diabetes_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

>>> print(regr.coef_)

[   0.30349955 -237.63931533  510.53060544  327.73698041 -814.13170937

  492.81458798  102.84845219  184.60648906  743.51961675   76.09517222]

 

>>> # The mean square error均方差

>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)

2004.56760268...

>>> # Explained variance score: 1 is perfect prediction

>>> # and 0 means that there is no linear relationship

>>> # between X and Y.方差

>>> regr.score(diabetes_X_test, diabetes_y_test)

0.5850753022690...

 

Shrinkage 收缩

出现情况:每个维度数据少,且有高方差的噪声

regr = linear_model.LinearRegression()

pl.plot(test, regr.predict(test))

 

 

 

 解决方案:

收缩回归系数到0;岭回归引入的偏差其实是正则化。捕捉噪声的规律使其不能再新数据上通用的情况叫做过拟合。

regr = linear_model.Ridge(alpha=.1)

pl.figure()

 

np.random.seed(0)

for _ in range(6):

    this_X = .1*np.random.normal(size=(2, 1)) + X

    regr.fit(this_X, y)

    pl.plot(test, regr.predict(test))

pl.scatter(this_X, y, s=3)

 

alphas = np.logspace(-4, -1, 6)

from __future__ import print_function

>>> print([regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train,).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])

[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]

 

典型的bias/variance tradeoff:大 ridge alpha parameter, 高bias and 低 variance.

 

Sparsity稀疏

Example:diabetes dataset 涉及11个维度,很难通过可视化手段分析出有用的信息,但是时刻谨记着模型(数据)有个能是个相当空的space,这也许非常重要。

稀疏是只选择有信息的特征,把没信息的特征系数置零。Ridge regression是减少还没有到零。Lasso (least absolute shrinkage and selection operator)是置零。这种方法被称为稀疏化或稀疏方法,同时他是一种Occam’s razor : prefer simpler models.

regr = linear_model.Lasso()

scores = [regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train).score(diabetes_X_test, diabetes_y_test)for alpha in alphas]

best_alpha = alphas[scores.index(max(scores))]

regr.alpha = best_alpha

>>> regr.fit(diabetes_X_train, diabetes_y_train)

Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,

   max_iter=1000, normalize=False, positive=False, precompute=False,

   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

>>> print(regr.coef_)

[   0.         -212.43764548  517.19478111  313.77959962 -160.8303982    -0.

 -187.19554705   69.38229038  508.66011217   71.84239008]

Lasso ->using a coordinate decent method, that is efficient on large datasets.

LassoLars -> using the LARS algorithm which is is very efficient for problems in which the weight vector estimated is very sparse

 

LogisticRegression中C决定着正则化的程度: a large value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while penalty="l1" gives Sparsity.

逻辑回归LogisticRegression

Example:对于iris task,线型回归不是正确的方法,因为它给远离决策边界的数据太大的权重,有效的方法是fit logitstic function(更接近于阶跃信号)。


 

AdaBoost

提升方法 AdaBoost 算法

算法 (AdaBoost)
输入:训练数据集T={(x1,y1),(x2,y2),...,(xn,yn)},弱学习算法;
输出:最终分类器 G(x).
(1)初始化训练数据的权值分布
D1=(w11,...,w1i,...,w1N),w1i= ,  i=1,2,...,N

(2) 对 m= 1,2,…,M
(a) 使用具有权值分布化Dm的训练数据集学习,得到基本分类器Gm(x)
(b) 计算Gm(X) 在训练数据集上的分类误差率
em=P(Gm(xi) yi)=  
(C) 计算 Gm (X) 的系数 

= 这里的对数是自然对数.
(d) 更新训练数据集的权值分布
Dm+1=(wm+1,1,…, wm+1,i,…, wm+1,N)
 

这里, 是规范因子,  ,它使Dm+1成为一个概率分布。
(3) 构建基本分类器的线性组合  得到最终分类器  

对 AdaBoost 算法作如下说明:
步骤(1)假设训练数据集具有均匀的权值分布,即每个训练样本在基本分类器的学习中作用相同, 这一假设保证第一步能够在原始数据上学习基本分类器G1(x)
步骤(2) AdaBoost 反复学习基本分类器,在每一轮m=1,2,…,M顺次地执行下列操作:
(a) 使用当前分布Dm加权的训练数据集,学习基本分类器 Gm(x).
(b) 计算基本分类器 Gm(x)在加权训练数据集上的分类误差率:
 =P(Gm(xi) yi)= 由此可以看出数据权值分布Dm与基本分类器 Gm(x)的分类误差率的关系.
(c) 计算基本分类器 Gm (x) 的系数表示 , 在最终分类器中的重要性。当 <0.5时, >0, 并且 随着 的减小而增大,所以分类误差率越小的基本分类器在最终分类器中的作用越大.
(d) 更新训练数据的权值分布为下一轮作准备:

 

由此可知,被基木分类器 误分类样本的权值得以扩大,而被正确分类样本的权值却得以缩小。两相比较,误分类样本的权值被放大 倍.因此,误分类样本在下一轮学习中起更大的作用。不改变所给的训练数据,而不断改变训练数据权值的分布,使得训练数据在基本分类器的学习中起不同的作用 ,这是AdaBoost的一个特点。
步骤(3)线性组合f(x)实现M个基本分类器的加权表决.系数 表示了基本分类器 的重要性,这里,所有 之和并不为1. f(x)的符号决定实例x的类, f(x)的绝对值表示分类的确信度。利用基本分类器的线性组合构建最终分类器是 AdaBoost 的另一特点。

支持向量机SVM

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm

iris = datasets.load_iris()
X = iris.data
y = iris.target

X = X[y != 0, :2]
y = y[y != 0]

n_sample = len(X)

np.random.seed(0)
order = np.random.permutation(n_sample)
X = X[order]
y = y[order].astype(np.float)

X_train = X[:.9 * n_sample]
y_train = y[:.9 * n_sample]
X_test = X[.9 * n_sample:]
y_test = y[.9 * n_sample:]

# fit the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
    clf = svm.SVC(kernel=kernel, gamma=10)
    clf.fit(X_train, y_train)

    plt.figure(fig_num)
    plt.clf()
    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired)

    # Circle out the test data
    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)

    plt.axis('tight')
    x_min = X[:, 0].min()
    x_max = X[:, 0].max()
    y_min = X[:, 1].min()
    y_max = X[:, 1].max()

    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(XX.shape)
    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
                levels=[-.5, 0, .5])

    plt.title(kernel)
plt.show()

聚类

k-means

Hierarchical agglomerative clustering: Ward分层聚类

Agglomerative  - bottom-up

Divisive - top-down

from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering

# Generate data
lena = sp.misc.lena()
# Downsample the image by a factor of 4
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
X = np.reshape(lena, (-1, 1))

###############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*lena.shape)

###############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15  # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters,
        linkage='ward', connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)

 

Connectivity-constrained clustering关联限制聚类

 

Feature agglomeration特征聚集->用来降维数据

我们已经看到稀疏性可以用于减轻维度灾,即与特征的数量相比观察量不足。 另一种方法是将类似特征合并在一起:特征聚集。 这种方法可以通过在特征方向上聚类来实现,换句话说,对转置的数据进行聚类。

digits = datasets.load_digits()

images = digits.images

X = np.reshape(images, (len(images), -1))

connectivity = grid_to_graph(*images[0].shape)

agglo = cluster.FeatureAgglomeration(connectivity=connectivity,

...                                      n_clusters=32)

>>> agglo.fit(X)

FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...

X_reduced = agglo.transform(X)

X_approx = agglo.inverse_transform(X_reduced)

>>> images_approx = np.reshape(X_approx, images.shape)

Decompositions分解、降维

PCA

 

The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat.

三个维度,有一个维度特别扁平,用其他俩个维度可以计算出来。PCA就是找不扁平的方向。

>>> # Create a signal with only 2 useful dimensions

>>> x1 = np.random.normal(size=100)

>>> x2 = np.random.normal(size=100)

>>> x3 = x1 + x2

>>> X = np.c_[x1, x2, x3]

 

>>> from sklearn import decomposition

>>> pca = decomposition.PCA()

>>> pca.fit(X)

PCA(copy=True, n_components=None, whiten=False)

>>> print(pca.explained_variance_) 

[  2.18565811e+00   1.19346747e+00   8.43026679e-32]

>>> # As we can see, only the 2 first components are useful

>>> pca.n_components = 2

>>> X_reduced = pca.fit_transform(X)

>>> X_reduced.shape

(100, 2)

 

ICA-> Independent Component Analysis

ICA selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:

ICA选择组件,使得它们的负载的分布携带最大量的独立信息。 它能够恢复非高斯独立信号:

>>> # Generate sample data

>>> time = np.linspace(0, 10, 2000)

>>> s1 = np.sin(2 * time)  # Signal 1 : sinusoidal signal

>>> s2 = np.sign(np.sin(3 * time))  # Signal 2 : square signal

>>> S = np.c_[s1, s2]

>>> S += 0.2 * np.random.normal(size=S.shape)  # Add noise

>>> S /= S.std(axis=0)  # Standardize data

>>> # Mix data

>>> A = np.array([[1, 1], [0.5, 2]])  # Mixing matrix

>>> X = np.dot(S, A.T)  # Generate observations

 

>>> # Compute ICA

>>> ica = decomposition.FastICA()

>>> S_ = ica.fit_transform(X)  # Get the estimated sources

>>> A_ = ica.mixing_.T

>>> np.allclose(X,  np.dot(S_, A_) + ica.mean_)

True

 

 

Pipelining 管道

Example:结合transform模型和predict模型

The PCA does an unsupervised dimensionality reduction, while the logistic,regression does the prediction.

from sklearn import linear_model, decomposition, datasets

from sklearn.pipeline import Pipeline

from sklearn.grid_search import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()

pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()

X_digits = digits.data

y_digits = digits.target

# Plot the PCA spectrum光谱

pca.fit(X_digits)

plt.figure(1, figsize=(4, 3))

plt.clf()

plt.axes([.2, .2, .7, .7])

plt.plot(pca.explained_variance_, linewidth=2)

plt.axis('tight')

plt.xlabel('n_components')

plt.ylabel('explained_variance_')

 

# Prediction

n_components = [20, 40, 64]

Cs = np.logspace(-4, 4, 3)

#Parameters of pipelines can be set using ‘__’ separated parameter names:

estimator = GridSearchCV(pipe, dict(pca__n_components=n_components,logistic__C=Cs))

estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,

            linestyle=':', label='n_components chosen')

plt.legend(prop=dict(size=12))

 

 

类型转换

1.没有特殊说明的话,输入需要(?)被转换为float64

X = np.array(X, dtype='float32')
>>> X.dtype
dtype('float32')
 
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.dtype
dtype('float64')

 

2.回归结果将转换为float64,分类目标则保持原态

clf.fit(iris.data, iris.target)  
>>> list(clf.predict(iris.data[:3]))
[0, 0, 0]
clf.fit(iris.data, iris.target_names[iris.target])  
>>> list(clf.predict(iris.data[:3]))  
['setosa', 'setosa', 'setosa']
 

 

Numpy

1.array vs list

 python中的list是python的内置数据类型,list中的数据类不必相同的,而array的中的类型必须全部相同。

2.np.unique(iris_y)

 

 

Pandas

 

Example

Working With Text Data - 20 newsgroups dataset

Loaddata

from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
# print twenty_train.target_names
# print len(twenty_train.data)
# print len(twenty_train.filenames)
# print twenty_train.data[0].split('\n')[:3]
# print twenty_train.target_names[twenty_train.target[0]]  # 装配
# print twenty_train.target[:10]
# print twenty_train.target_names

 

提取特征 - 词袋索引

# X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM
# which is barely manageable on today’s computers.
# Fortunately, most values in X will be zeros since for a given document less than a couple
# thousands of distinct words will be used.
# For this reason we say that bags of words are typically high-dimensional sparse datasets.
# scipy.sparse matrices are data structures that do exactly this,
# and scikit-learn has built-in support for these structures.

# Text preprocessing, tokenizing and filtering of stopwords are included in
# sklearn.feature_extraction.text.CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
print X_train_counts.shape

Tfidf模型

# tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
# X_train_tf = tf_transformer.transform(X_train_counts)
# print X_train_tf.shape
# print X_train_tf[0]
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print X_train_tfidf.shape

Classifer

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

pipeline

from sklearn.pipeline import Pipeline

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ])
text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Evaluation

import numpy as np

twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
print np.mean(predicted == twenty_test.target)

Other Classifer

from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)), ])
_ = text_clf.fit(twenty_train.data, twenty_train.target)
predicted = text_clf.predict(docs_test)
print np.mean(predicted == twenty_test.target)

混乱矩阵 - detailed performance analysis

from sklearn import metrics

print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))

print metrics.confusion_matrix(twenty_test.target, predicted)

调参数 - Parameter tuning

GridSearchCV

from sklearn.grid_search import GridSearchCV

parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3), }
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1)
gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])
print twenty_train.target_names[gs_clf.predict(['God is love'])]
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))
print score

  

 

 

 

posted @ 2017-03-20 10:43  zcbmxvn987  阅读(1235)  评论(0编辑  收藏  举报